Qian Zhang, qz2416
# python version
!python --version
# install packages
!pip install nltk
!pip install collections
!pip install vaderSentiment
!pip install twython
!pip install rake_nltk
!pip install wordCloud
# import libraries or download packages
import nltk
nltk.download("vader_lexicon")
nltk.download('stopwords')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from rake_nltk import Rake
from collections import Counter
from wordcloud import WordCloud, STOPWORDS
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
Last term, I needed some papers that supporting my statements when I was writing a report for a course project. At that time, I had to spend lots of time reading papers. It is extremely time-consuming and I believe I would face the same situation again in the future. I am wondering how to make this process of checking content and attitudes more effective.
First, I come up with a scientific question:
The correct answer to this question, in the ideal situation, is to read all the materials I can find to see authors' opinions. Then I will choose those that supports my arguments.
However, it is unrealistic if there are thousthands of papers or textbooks.
Now we are given the philosophy dataset.
# read csv file
df = pd.read_csv (
r'/Users/apple/Fall2021-Project1-QianZhang-Erica/data/philosophy_data.csv')
Let's take a look at the data:
# check for null values
df.info()
# display part of the data
df[0:10]
It contains 360,808 sentences from over 50 texts spanning 13 schools of philosophy. According to the output, there are no null values.
By the data given, I come up with a method to analyze the data to answer the scientific question. The current dataframe is what I need for my method, so there isn't any preprocessing.
Sentiment analysis uses natural language processing techniques to detect positive, negative or neutral attitude in a text.
Keyword extraction is a technique that extracts the most used and most important words in a text.
In this project, I use package vaderSentiment to do a sentiment analysis since this is the method used for unlabelled text data (the texts are not labelled as 'positive', 'negative' or 'neutral', so they are unlabelled).
For keyword extraction, package rake_nltk is used since it is more flexible as we are able to control the number of words in keywords using this package. For example, we can have a keyword "data science" or "advanced data science".
In the method, I treat these sentences by schools.
First, I will do a sentiment analysis on every sentence for one school to see if the author's attitude is positive, negative or neutral. At the same time, I will do a keyword extraction on each text. As a result, I am able to get percentages of positive, negative and neutral attitudes and keywords for positive, negative and neutral attitudes in that particular school. Then, I will repeat this process on every school. In this way, I am able to check which school has texts that are related to my topic and whether the texts supports my arguments.
# set plot size
plt.rcParams['figure.figsize'] = [14, 12]
# Create stopword list:
stopw = set(STOPWORDS)
# decide threshold of attitudes
def classify_sentiment(vds_res):
if vds_res['compound'] >= 0.05:
return "pos"
elif vds_res['compound'] <= -0.05:
return "neg"
else:
return "neu"
# count numbers of positive, negative and neutral, and extract keywords
def each_school(df_school):
num_pos = 0
num_neg = 0
num_neu = 0
kw_pos = []
kw_neg = []
kw_neu = []
# use sentence_lowere
for i in range (len(df_school)):
res = classify_sentiment((vds.polarity_scores(df_school['sentence_lowered'].iloc[i])))
if res == "pos":
num_pos += 1
rn.extract_keywords_from_text(df_school['sentence_lowered'].iloc[i])
kw_pos.extend(rn.get_ranked_phrases())
elif res == "neg":
num_neg += 1
rn.extract_keywords_from_text(df_school['sentence_lowered'].iloc[i])
kw_neg.extend(rn.get_ranked_phrases())
else:
num_neu += 1
rn.extract_keywords_from_text(df_school['sentence_lowered'].iloc[i])
kw_neu.extend(rn.get_ranked_phrases())
return [num_pos, num_neg, num_neu], kw_pos, kw_neg, kw_neu
# count frequencies of each keyword
def count_freq(dict_per_school):
freq_dict_pos = dict(Counter(dict_per_school[0]))
freq_dict_neg = dict(Counter(dict_per_school[1]))
freq_dict_neu = dict(Counter(dict_per_school[2]))
return freq_dict_pos, freq_dict_neg, freq_dict_neu
# draw pie chart
def draw_pie(school_name):
plt.figure(figsize=(8,8))
plt.pie(df_plot_per.loc[school_name,],
autopct='%1.1f%%',
labels=['Positive', 'Negative', 'Neutral'],
explode=(0.02, 0.02, 0.02),
colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
# draw word cloud
def draw_cloud(school_key_dict):
po, ne, nu = count_freq(school_key_dict)
w1 = WordCloud(stopwords=stopw, collocations=False).generate_from_frequencies(po)
plt.imshow(w1, interpolation = 'bilinear')
plt.axis("off")
plt.title('Word Cloud - Positive')
plt.show()
w2 = WordCloud(stopwords=stopw, collocations=False).generate_from_frequencies(ne)
plt.imshow(w2, interpolation = 'bilinear')
plt.axis("off")
plt.title('Word Cloud - Negative')
plt.show()
w3= WordCloud(stopwords=stopw, collocations=False).generate_from_frequencies(nu)
plt.imshow(w3, interpolation = 'bilinear')
plt.axis("off")
plt.title('Word Cloud - Neutral')
plt.show()
# sentiment analysis
vds = SentimentIntensityAnalyzer()
# keyword extraction
rn = Rake()
# find names of those schools
school_list = df['school'].unique()
# generate results
res_dict = {}
key_dict = {}
for i in school_list:
df_school = df[df['school'] == i]
res = each_school(df_school)
res_dict[i] = res[0]
key_dict[i] = [res[1], res[2], res[3]]
# transform the results for plotting graphs
df_plot = pd.DataFrame.from_dict(
res_dict,orient='index', columns = ['Positive', 'Negative', 'Neutral'])
df_plot.index.name = 'School'
df_plot.reset_index(inplace=True)
df_plot_t = df_plot["Positive"] + df_plot["Negative"] + df_plot["Neutral"]
df_plot_per = df_plot[df_plot.columns[1:]].div(df_plot_t, 0)*100
# histogram
fig, ax = plt.subplots()
plt.bar(school_list, df_plot_t)
plt.title('Histogram for total number of texts of each school')
fig.autofmt_xdate()
This plot shows the total number of texts in each school. We can see stoicism has relatively small amount of texts.
# Normalized Stacked Barplot
plt.figure(figsize=(12,12))
fig, ax = plt.subplots()
names = school_list
positive = df_plot_per['Positive']
negative = df_plot_per['Negative']
neutral = df_plot_per['Neutral']
# stack bars
plt.bar(names, positive, label='positive')
plt.bar(names, negative, bottom=positive,label='negative')
plt.bar(names, neutral, bottom=positive+negative, label='neutral')
# add percentage
for xs, ys, yval in zip(names, positive/2, positive):
plt.text(xs, ys, "%.1f"%yval, ha="center", va="center")
for xs, ys, yval in zip(names, positive+negative/2, negative):
plt.text(xs, ys, "%.1f"%yval, ha="center", va="center")
for xs, ys, yval in zip(names, positive+negative+neutral/2, neutral):
plt.text(xs, ys, "%.1f"%yval, ha="center", va="center")
# add total
for xs, ys, yval in zip(names, positive+negative+neutral, df_plot_t):
plt.text(xs, ys, yval, ha="center", va="bottom")
plt.title("Normalized Stacked Barplot")
fig.autofmt_xdate()
plt.legend(bbox_to_anchor=(1.01,0.5), loc='center left')
This plots shows the percentage of positive, negative and neutral attitudes for texts in each school. The number above the bar is total number of texts in that particular school.
This plot provides lots of information. For example, capitalism has the highest percentage of positive attitudes; feminism has the highest percentage of negative attitudes; phenomenology has the highest percentage of neutral attitudes. Lots of comparsions can be made here.
# transform data for plotting
df_plot_per.index = school_list
draw_pie('plato')
draw_cloud(key_dict['plato'])
draw_pie('aristotle')
draw_cloud(key_dict['aristotle'])
draw_pie('empiricism')
draw_cloud(key_dict['empiricism'])
draw_pie('rationalism')
draw_cloud(key_dict['rationalism'])
draw_pie('analytic')
draw_cloud(key_dict['analytic'])
draw_pie('continental')
draw_cloud(key_dict['continental'])
draw_pie('phenomenology')
draw_cloud(key_dict['phenomenology'])
draw_pie('german_idealism')
draw_cloud(key_dict['german_idealism'])
draw_pie('communism')
draw_cloud(key_dict['communism'])
draw_pie('capitalism')
draw_cloud(key_dict['capitalism'])
draw_pie('stoicism')
draw_cloud(key_dict['stoicism'])
draw_pie('nietzsche')
draw_cloud(key_dict['nietzsche'])
draw_pie('feminism')
draw_cloud(key_dict['feminism'])
By looking at the pie chart and the word cloud for each school, I am able to see each school's attitude on several topics. These results are helpful to answer the scientific question: which material talks about my topic and supporting my opinion. For example, if I am looking for texts that are related to capitalism, I should focus on school Capitalism since relative words do not appear in any other word cloud graph. Then it shows positive arguments are more likely to be found (56.1%). If I support capitalism, it is worth looking at those texts; if I am against capitalism, since the percentage of negative attitude is 19.7%, it is a little bit difficult to find relative texts and I might need to look at other kind of materials except those texts. Personally, the scientific question does help me inform the decision about what materials I should choose for my report.
(1) The total number of texts for each school varies a lot (stoicism has relatively small amount of texts) and this might affect the results.
(2) It can be seen that the word cloud graphs for each school are a little bit strange as the words "one", "say", "would" appear on the graphs. Those words do not bring me any information, but I still use default stopwords since I am not familiar with philosophy and I do not know whether they do matter in the philosophy area. The stopwords might need to be updated.